Import libraries

Check for missing values

Some correlations between the main class (spam, !spam) and the first 20 variables

Average word frequency in spam vs !spam

The words "you" and "your" are fare more frequent in spam emails than in !spam.

The words "free", "george" and "hp" are fare more frequent in !spam emails than in spam.

From the dataset description we know that the word "George" (and "650" as well, but is not significant) is not spam. So, the spammer don't know the victim name and call him "you" instead of his real name.

Spam emails have:

Capital letters are far used in spam emails, the more frequent they are, the more probably is spam

So, the spammer prefers to use capital letters to focus the victim attention on specific words, to scare him and rush him to click on the fake link in the email.